116 research outputs found

    Federated Survival Forests

    Get PDF
    Survival analysis is a subfield of statistics concerned with modeling the occurrence time of a particular event of interest for a population. Survival analysis found widespread applications in healthcare, engineering, and social sciences. However, real-world applications involve survival datasets that are distributed, incomplete, censored, and confidential. In this context, federated learning can tremendously improve the performance of survival analysis applications. Federated learning provides a set of privacy-preserving techniques to jointly train machine learning models on multiple datasets without compromising user privacy, leading to a better generalization performance. However, despite the widespread development of federated learning in recent AI research, few studies focus on federated survival analysis. In this work, we present a novel federated algorithm for survival analysis based on one of the most successful survival models, the random survival forest. We call the proposed method Federated Survival Forest (FedSurF). With a single communication round, FedSurF obtains a discriminative power comparable to deep-learning-based federated models trained over hundreds of federated iterations. Moreover, FedSurF retains all the advantages of random forests, namely low computational cost and natural handling of missing values and incomplete datasets. These advantages are especially desirable in real-world federated environments with multiple small datasets stored on devices with low computational capabilities. Numerical experiments compare FedSurF with state-of-the-art survival models in federated networks, showing how FedSurF outperforms deep-learning-based federated algorithms in realistic environments with non-identically distributed data

    Scaling Survival Analysis in Healthcare with Federated Survival Forests: A Comparative Study on Heart Failure and Breast Cancer Genomics

    Full text link
    Survival analysis is a fundamental tool in medicine, modeling the time until an event of interest occurs in a population. However, in real-world applications, survival data are often incomplete, censored, distributed, and confidential, especially in healthcare settings where privacy is critical. The scarcity of data can severely limit the scalability of survival models to distributed applications that rely on large data pools. Federated learning is a promising technique that enables machine learning models to be trained on multiple datasets without compromising user privacy, making it particularly well-suited for addressing the challenges of survival data and large-scale survival applications. Despite significant developments in federated learning for classification and regression, many directions remain unexplored in the context of survival analysis. In this work, we propose an extension of the Federated Survival Forest algorithm, called FedSurF++. This federated ensemble method constructs random survival forests in heterogeneous federations. Specifically, we investigate several new tree sampling methods from client forests and compare the results with state-of-the-art survival models based on neural networks. The key advantage of FedSurF++ is its ability to achieve comparable performance to existing methods while requiring only a single communication round to complete. The extensive empirical investigation results in a significant improvement from the algorithmic and privacy preservation perspectives, making the original FedSurF algorithm more efficient, robust, and private. We also present results on two real-world datasets demonstrating the success of FedSurF++ in real-world healthcare studies. Our results underscore the potential of FedSurF++ to improve the scalability and effectiveness of survival analysis in distributed settings while preserving user privacy

    Heterogeneous Datasets for Federated Survival Analysis Simulation

    Get PDF
    Heterogeneous Datasets for Federated Survival Analysis Simulation This repo contains three algorithms for constructing realistic federated datasets for survival analysis. Each algorithm starts from an existing non-federated dataset and assigns each sample to a specific client in the federation. The algorithms are: uniform_split: assigns each sample to a random client with uniform probability; quantity_skewed_split: assigns each sample to a random client according to the Dirichlet distribution [3, 4]; label_skewed_split: assigns each sample to a time bin, then assigns a set of samples from each bin to the clients according to the Dirichlet distribution [3, 4]. For more information, please take a look at our paper at https://arxiv.org/abs/2301.12166 [1]. Content federated_survival_datasets.zip: the content of the repository at https://github.com/archettialberto/federated_survival_datasets Heterogheneous_Datasets_for_Federated_Survival_Analysis_Simulation.pdf: the conference paper describing the work. Installation Federated Survival Datasets is built on top of numpy and scikit-learn. To install those libraries you can run pip install -r requirements.txt. To import survival datasets into your project, we strongly recommend SurvSet (https://github.com/ErikinBC/SurvSet) [2], a comprehensive collection of more than 70 survival datasets. Usage import numpy as np import pandas as pd from federated_survival_datasets import label_skewed_split # import a survival dataset and extract the input array X and the output array y df = pd.read_csv("metabric.csv") X = df[[f"x{i}" for i in range(9)]].to_numpy() y = np.array([(e, t) for e, t in zip(df["event"], df["time"])], dtype=[("event", bool), ("time", float)]) # run the splitting algorithm client_data = label_skewed_split(num_clients=8, X=X, y=y) # check the number of samples assigned to each client for i, (X_c, y_c) in enumerate(client_data): print(f"Client {i} - X: {X_c.shape}, y: {y_c.shape}") We provide an example notebook in the zipped folder to illustrate the proposed algorithms. It requires scikit-survival, seaborn, and pandas. References [1] Archetti, A., Lomurno, E., Lattari, F., Martin, A., & Matteucci, M. (2023). Heterogeneous Datasets for Federated Survival Analysis Simulation. arXiv preprint arXiv:2301.12166. [2] Drysdale, E. (2022). SurvSet: An open-source time-to-event dataset repository. arXiv preprint arXiv:2203.03094. [3] Hsu, T. M. H., Qi, H., & Brown, M. (2019). Measuring the effects of non-identical data distribution for federated visual classification. arXiv preprint arXiv:1909.06335. [4] Li, Q., Diao, Y., Chen, Q., & He, B. (2022, May). Federated learning on non-iid data silos: An experimental study. In 2022 IEEE 38th International Conference on Data Engineering (ICDE) (pp. 965-978). IEEE

    The Bi-objective Long-haul Transportation Problem on a Road Network

    Full text link
    In this paper we study a long-haul truck scheduling problem where a path has to be determined for a vehicle traveling from a specified origin to a specified destination. We consider refueling decisions along the path, while accounting for heterogeneous fuel prices in a road network. Furthermore, the path has to comply with Hours of Service (HoS) regulations. Therefore, a path is defined by the actual road trajectory traveled by the vehicle, as well as the locations where the vehicle stops due to refueling, compliance with HoS regulations, or a combination of the two. This setting is cast in a bi-objective optimization problem, considering the minimization of fuel cost and the minimization of path duration. An algorithm is proposed to solve the problem on a road network. The algorithm builds a set of non-dominated paths with respect to the two objectives. Given the enormous theoretical size of the road network, the algorithm follows an interactive path construction mechanism. Specifically, the algorithm dynamically interacts with a geographic information system to identify the relevant potential paths and stop locations. Computational tests are made on real-sized instances where the distance covered ranges from 500 to 1500 km. The algorithm is compared with solutions obtained from a policy mimicking the current practice of a logistics company. The results show that the non-dominated solutions produced by the algorithm significantly dominate the ones generated by the current practice, in terms of fuel costs, while achieving similar path durations. The average number of non-dominated paths is 2.7, which allows decision makers to ultimately visually inspect the proposed alternatives

    SGDE: Secure Generative Data Exchange for Cross-Silo Federated Learning

    Get PDF
    Privacy regulation laws, such as GDPR, impose transparency and security as design pillars for data processing algorithms. In this context, federated learning is one of the most influential frameworks for privacy-preserving distributed machine learning, achieving astounding results in many natural language processing and computer vision tasks. Several federated learning frameworks employ differential privacy to prevent private data leakage to unauthorized parties and malicious attackers. Many studies, however, highlight the vulnerabilities of standard federated learning to poisoning and inference, thus raising concerns about potential risks for sensitive data. To address this issue, we present SGDE, a generative data exchange protocol that improves user security and machine learning performance in a cross-silo federation. The core of SGDE is to share data generators with strong differential privacy guarantees trained on private data instead of communicating explicit gradient information. These generators synthesize an arbitrarily large amount of data that retain the distinctive features of private samples but differ substantially. In this work, SGDE is tested in a cross-silo federated network on images and tabular datasets, exploiting beta-variational autoencoders as data generators. From the results, the inclusion of SGDE turns out to improve task accuracy and fairness, as well as resilience to the most influential attacks on federated learning

    Towards cross-cohort estimation of cognitive decline in neurodegenerative diseases

    Get PDF
    International audienceHeterogeneity of cohorts, in terms of inclusion criteria, design of follow-up visits and batteries of cognitive assessments, hinders any thorough comparisons between them. For that reason, we build a cross-cohort model of cognitive decline that can be personalized to any patient, allowing to impute partially or totally missing scores. This enables to compare at an individual level disease progression of subjects from different cohorts, with a temporal realignment and regarding a broader set of biomarkers

    Differences Between Plasma and Cerebrospinal Fluid p-tau181 and p-tau231 in Early Alzheimer's Disease

    Get PDF
    Plasma phosphorylated tau species have been recently proposed as peripheral markers of Alzheimer's disease (AD) pathology. In this cross-sectional study including 91 subjects, plasma and cerebrospinal fluid (CSF) p-tau181 and p-tau231 levels were elevated in the early symptomatic stages of AD. Plasma p-tau231 and p-tau181 were strongly related to CSF phosphorylated tau, total tau and amyloid and exhibited a high accuracy-close to CSF p-tau231 and p-tau181-to identify AD already in the early stage of the disease. The findings might support the use as diagnostic and prognostic peripheral AD biomarkers in both research and clinical settings

    Rare mutations in SQSTM1 modify susceptibility to frontotemporal lobar degeneration

    Get PDF
    Mutations in the gene coding for Sequestosome 1 (SQSTM1) have been genetically associated with amyotrophic lateral sclerosis (ALS) and Paget disease of bone. In the present study, we analyzed the SQSTM1 coding sequence for mutations in an extended cohort of 1,808 patients with frontotemporal lobar degeneration (FTLD), ascertained within the European Early-Onset Dementia consortium. As control dataset, we sequenced 1,625 European control individuals and analyzed whole-exome sequence data of 2,274 German individuals (total n = 3,899). Association of rare SQSTM1 mutations was calculated in a meta-analysis of 4,332 FTLD and 10,240 control alleles. We identified 25 coding variants in FTLD patients of which 10 have not been described. Fifteen mutations were absent in the control individuals (carrier frequency < 0.00026) whilst the others were rare in both patients and control individuals. When pooling all variants with a minor allele frequency < 0.01, an overall frequency of 3.2 % was calculated in patients. Rare variant association analysis between patients and controls showed no difference over the whole protein, but suggested that rare mutations clustering in the UBA domain of SQSTM1 may influence disease susceptibility by doubling the risk for FTLD (RR = 2.18 [95 % CI 1.24-3.85]; corrected p value = 0.042). Detailed histopathology demonstrated that mutations in SQSTM1 associate with widespread neuronal and glial phospho-TDP-43 pathology. With this study, we provide further evidence for a putative role of rare mutations in SQSTM1 in the genetic etiology of FTLD and showed that, comparable to other FTLD/ALS genes, SQSTM1 mutations are associated with TDP-43 pathology
    corecore